A simple stochastic model for the evolution of protein lengths 
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We analyse a simple discrete-time stochastic process for the theoretical modeling of the evolution of 
protein lengths. At every step of the process a new protein is produced as a modification of one of the 
proteins already existing and its length is assumed to be random variable which depends only on the 
length of the originating protein. Thus a Random Recursive Trees (RRT) is produced over the natural 
integers. If (quasi) scale invariance is assumed, the length distribution in a single history tends to a 
lognormal form with a specific signature of the deviations from exact gaussianity. Comparison with the 
very large SIMAP protein database shows good agreement. 



INTRODUCTION 

Nowadays, it is well established that the great variety 
of proteins in biological systems has been produced dur- 
ing the course of evolution by means of gene mutations 
that take effect at the coding level • The main mech- 
anisms are: duplication of genome segments that con- 
tain sequences coding for one o more protein domains 
El 0, Si ; divergence of the duplicated sequences by 
insertion, deletion and substitution of one or more base 
pairs |@, [Tol. [TTL [l2h ; domain rearrangements, such as 
gene fusions and gene fission fl3l 141. domain recombi- 
nation 1 15, 1(J, gene shuffling (recombination between 



dissimilar genes) [fl7h and domain insertions and dele- 
tions Mia. [19(1 . By means of these microscopic mecha- 
nisms, iterated a huge number of times throughout the 
ages of evolution, an initial protein population, most 
likely very small and poorly assorted, has been enor- 
mously increased to the present very large number and 
complex variety. 

A valuable framework for the effective modeling of the 
evolutions of genes and proteins could be provided by 
stochastic processes. In the most general formulation, 
one should take into account the complex organization of 
biological systems into independent organisms grouped 
in turn into species, genera and kingdoms, as well as the 
complicated effects of natural selection. However, since 
all evolution mechanisms generate new biological mate- 
rial by means of modifications of the biological material 
already existing, in the case of proteins we may imagine 
a simpler, more abstract discrete-time stochastic process 
over the space of all amino acid sequences such that at 
each time step t = 1,2,3,... a new protein is generated 
with some prescribed random mechanism from the set 
of proteins already existing. Clearly the dicrete time of 
the model has nothing to do with the time of the true 
biological evolution process, except that it is a (almost) 
monotonically increasing function of the latter, at least 
on time scales large enough (great mass extinctions cor- 
respond most likely to periods when this monotonicity is 
lost). Moreover, a single time step in the process would 
correspond to some averaging over a multitude of differ- 



ent effects, both at microscopic, or biochemical level, and 
at the macroscopic level of the selection-based evolution 
mechanisms. 

This abstract stochastic process would be specified by 
Pr(pt+i |{p}t)> that is the conditional probability that the 
t + 1 protein has the amino acid sequence pt+i, given 
that there is already a set {p} t = {pi,pi, ■ ■ - Pt} of dis- 
tinct proteins at time t. In principle, Pr(p t+ i|{p} t ) might 
embody the effects of many, if not all, of the compli- 
cated biochemical and evolutional mechanisms alluded 
above and should depend explicitly on time. Moreover, 
the initial configuration of proteins could be assumed to 
coincide with the actual set of distinct amino acid se- 
quences present in nature at some moment in the distant 
past, when their total number was much smaller than 
the present one. In any case, a huge amount of informa- 
tion is required for a complete specification of the model 
endowed with detailed predictive power. On the other 
hand, all what we may hope to reproduce in reasonably 
simple terms, to a large extent independent on the de- 
tails of the model, are some broad characteristics of the 
distribution of protein currently observed. This is indeed 
our main working hypothesis, based on the fact that the 
very large number of proteins in the SIMAP database do 
show simple universal properties 1I22I l23h . Then we may 
rely on the basic universality property, typical of a wide 
class of stochastic processes, that is the capability to for- 
get the details of the initial transient regime and to re- 
lax toward a statistical equilibrium or quasi-equilibrium 
state which depends only on very general features of the 
conditional probability Pr(pt+i|{p}t) an d is character- 
ized by few, weakly time-dependent "macroscopic" pa- 
rameters. 



A STOCHASTIC PROCESS FOR PROTEIN LENGTHS 

In the present work we concentrate our attention on 
the distribution of protein lengths, that is the observed 
frequency of proteins with a specific number of amino 
acids over the set of all known proteins. Thus we can 
consider only the protein length as the random vari- 
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able of the stochastic process. By definition this ran- 
dom length takes values in the natural numbers and we 
denote it with the symbol £. We also observe that by 
costruction all the proteins produced in the process can 
be ordered according to the time of production, starting 
from t = Nq, with N the number of distinct proteins in 
the initial configuration, and arriving to t = N, with N 
of the order of the number of distinct proteins that ex- 
ist now in nature, which is of order 10 7 or more. The 
statistical dynamics of the process is fully determined by 
Pi(£t+i\£t,£t-i, ■ ■ ■ ,£i), that is the conditional probabil- 
ity that the t + 1 protein has length £t+i, given that the 
preceeding proteins have the indicated lengths. This con- 
ditional probability could depend explicitly on the formal 
time t. 

As already discussed above in a more general con- 
text, the detailed biological mechanisms that constrain 
Pr(£ t +i\£t,£t-i, ■ ■ ■ ,£i) are far too complex to be ex- 
plicitly incorporated in the model. Therefore, we shall 
make simple and workable assumptions on the condi- 
tional probability relying, in practice, on some sort of 
central limit theorem for the probability that a protein 
taken at random from the state of a very long random 
process has a certain length. 

As a first simplifying assumption on the conditional 
probability Pi{£ t+ i\£t,£t-i, . . . ,£ ) we make that of lo- 
cality. That is, we assume that a given protein length can 
be produced from a preceding length independently on 
all the other lengths already produced. Hence we can 
write 

t 

Pr(£ t+1 \£ u £ t ^ 1 ,...,£ 1 )=Y,<lsW s (£ t+1 \£ s ) (1) 

s=l 

where the nonnegative weights q s are properly normal- 
ized, Y?s=iQs = 1> an d W s {£\£') can be interpreted 
as Markovian transition probabilities. In the absence 
of any other information, one would assume equal a 
priori probabilities among the different proteins, that 
is g s = l/t and W s {£\£') = W{£\£') with no explicit 
s— dependence. This might appear in conflict, however, 
with the global changes of ecosystems as well as with 
the complex organization of biological systems in king- 
doms and species (which suggests that all proteins ex- 
isting nowadays can be roughtly divided into subsets of 
similar proteins having almost independent evolutional 
histories, as least not too far in the past). We may take 
this into account by restricting the predictions of our 
stochastic process to suitable chosen subsets of the pro- 
teins of the SIMAP database, according, for instance, to 
given kingdoms. Moreover, we can neglect, on average, 
the global changes of ecosystems by placing the start of 
the process not too deep in the past. Alltogether let us 
assume that 

1 * 

Pi(£ t+1 \£ t ,£t~u ■■■,£!) = -^W(£ t+1 \£ s ) (2) 

s=l 



Our stochastic process now differs from a random walk 
on the natural integers only because at each step anyone 
of the already existing lengths, rather than only the last 
generated one, may serve as starting point for a jump to a 
new length. We are therefore dealing with the so-called 
Random Recursive Tree (RRT) [0, yD (more precisely, a 
random recursive forest) embedded by W{£\£') into the 
natural integers. It follows that the probability P(£,t) 
that the t— th protein has exactly length £ satisfies the 
non-Markovian evolution equation 

CO 

P{i,t+l) = J2w{i\i')Q{i',t) (3) 

v=i 

where Q{£,i) is the average length distribution (that is 
the mean fraction of proteins of length £) at time t and 
therefore evolves as 

(t + 1) Q{£, t + l)=t Q{£, t) + P{£, t + 1) (4) 

Together eqs. ([3]) and Q define a stochastic process with 
memory and should be compared to the Markov chain 
recursion for a simple Random Walk (RW) on all possible 
lengths, which would read instead 

CO 

p rw (^ + i) = j2 w w i ') pRW ( £l ' t ) (5) 

t'-l 

without any memory of the past. We still do have to 
make some choice on the explicit form of W (£{£'), in 
which case our stochastic process could be quite easily 
simulated on a computer. We expect, however, that the 
large time asymptotic regime of the process depends only 
on very general features of functional form of W(£\£'), 
again thanks to the universality hypothesis which has its 
roots in the law of large numbers. In any case, to in- 
vestigate the statistics of the produced lengths, we need 
beforehand some useful properties and formal manipu- 
lations valid for any W{£\£'). 

We recall first of all that by definition the nonnegative 
numbers W{£\£') satisfy the normalization condition: 

CO 

Y J W{£\£') = \ 

i=i 

These transition probabilities are the elements of a ma- 
trix W, the so-called stochastic matrix in the case of 
Markov processes. Without loss of generality, we may 
take W to be ergodic, that is such that any finite length 
can be produced after a suitable number of steps starting 
from any other finite length. 

Next we can exploit the linearity of eq. © to simplify 
the choice of initial conditions for P(£,t). As already 
stated above, the process is assumed to start with N 
distinct proteins, which we may take to have n distinct 
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lengths £j, j = 1,2, ... ,n, each repeated rij times so that 
. rij = N . This defines the initial length distribution 



1 n 

Q{l,N ) = —Y,n 3 6 U] 



3=1 



when i\T was the total number of distinct proteins. As 
the process may start from anyone of these initial pro- 
teins with equal probability 1/N , we may regard nj/N 
as probability that the process starts exactly from the 
length £j . Therefore the solution of eq. (O can be written 



3=1 



(6) 



J^P{l,t-N + l\l')Q{l',N ) 



where P(£,t\£') is the special solution that is concen- 
trated on the arbitrary value £' at t = 1, that is 
P{£, = Sw Similarly we have 

CO 

Q(£,t) = ^Q(l,t-N D + l\l')Q(l',N ) (7) 

where Q(£,t\£') is the solution of eq. (0]) specialized to 
P(£,t\£'), that is 



Q{£,t\l l ) = -Y j P{l,s\£ l ) 



(8) 



Clearly £' is the length of a specific protein which plays 
the role of root for the RRT, while the complete process 
is a forest of RRT's each having root in one of the N ini- 
tial proteins and growing in parallel. That is, the unique 
protein labelled by t has a fixed probability rij /N of be- 
longing to the tree rooted in a protein of length £ 2 . 
We may now introduce the matrix notation 

W{£\£') = [W] u , 
P{l,t\l')=[P{t)] n , 
Q(£,t\l')=[Q(t)] u , 

which allows to write the evolution equation for P(t) 
more compactly as 

1 * 

P(t + l) = WQ{t) = -J2 wp ( s ) (9) 

s=l 

Or equivalently as 

t-i 

tP{t + l) = Y^WP{s) + WP(t) = (t + W-l)P{t) (10) 



which has the formal solution 



i-l 



w-i\ w 1 - 1 



where z n stands for the so-called raising factorial product 
z(z + 1) . . . (z + n — 1) ll20h . The raising factorial gener- 
ates the (unsigned) Stirling numbers of the first kind as 
coefficients of its expansion in simple powers of z: 



fc=l 



where we adopted the square bracket notation of 
ref. ll20h for the Stirling numbers. Hence from eq. (II 1 D 
we can write 



(12) 



where P RW (t) = W* is the formal solution of the stan- 
dard Random Walk. Notice that, by eq. (O, P(t) is in- 
deed properly normalized, that is 

CO 

1=1 

since W s is also a stochastic matrix and Stirling numbers 
satisfy the normalization 



fc=i 



In fact, eq. (I12D shows that the quantity 



Ps(t) 



(i-l)!L s 



t - 1 



has the interpretation, for the abstract RRT, of probabil- 
ity that the node added at time t is at a chemical dis- 
tance s from the root of the tree, that is the original node 
present at t = 1. In terms of proteins, p s (t — N + l)nj/N 
is therefore the probability that the t— th protein is ob- 
tained through s changes from one of the i\f initial pro- 
teins of length £j . 

Notice also that the evolution equation (O allow to 
write an alternative expression for Q(t) which is local in 
time (but generally non local in "space") w.r.t. P[t) 

Q(t) = W^P{t + 1) = A £ [*] W- 1 (13) 

s=l 

One can see that p s +i{t + 1), which by construction sat- 
isfies 

1 * 

p,+i(f + 1) = -]T> s (fc) 
fc=i 

represents the average fraction of nodes at a distance s 
from the rootj^l. 
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For very large n we can use the approximation 



(14) 



which follows from Euler's infinite product represen- 
tation of the T function §2]h. From eqs. ([Tl]), (fT3l) 
and (fl4l ) we then find 



<3(0 



exp[(W - l)logi] 

r(w + i) 



(15) 



where we neglected all inverse powers of t in the ex- 
ponent, relying for uniformity on the boundedness of 
its spectrum of W. The crucial point of eq. (I15D is the 
very slow logarithmic dependence on time, which ap- 
pear evident upon comparison with the formal solution 
W l = exp(ilogVK) of the Markovian case. 

In order to provide more explicit expressions for P(t) 
and Q(t) we need some special assumtion on the stochas- 
tic matrix W. We do that in the next section. 



AVERAGE PROPERTIES OF SCALE-INVARIANT MODELS 

We describe here a class of examples which can be 
treated in detail at the analytic level. These are char- 
acterized by the assumption that our stochastic process 
is (almost) scale invariant. Intuitively, one expects that 
longer proteins can be changed throughout evolutions 
more easily than shorter ones. Exact scale invariance 
would mean that changes are proportional to length. 

To implement this picture, we first extend the lengths 
I from the positive integers to all real positive numbers. 
It will become apparent in the sequel that this exten- 
sion has very little impact on our conclusions. Next we 
change variables, from I to its logarithm x = logi and 
assume homogeneity in x, namely that 

W{l\l') dl = W{e x \e x ') d{e x ) = W{x - x') dx 

is translation invariant, i.e. function only of x— space dif- 
ferences. The process is very simple: at each time step 
the random walker may pick anyone of the previously 
visited points as starting point for the next step, whose 
value x is extracted with the one-step pdf (probability 
distribution function) W[x). In terms of protein lengths, 
at each step the length is rescaled by a factor e x . Since 
the true variables are discrete, we may take W[x) to be 
very smooth for all x. Likewise, since I > 1 we may take 
W(x) to vanish very quickly (let us say "faster than any 
power") for x — > — oo. For x — > +oo we assume instead 
quite reasonably that W(x) vanishes fast enough to have 
finite moments at least up to order four. We then intro- 
duce the follwing notation for the first two cumulants: 



/i = dxx W(x) , a 2 



dx {x - n) 2 W{x) 



that is the mean value and the squared fluctuations. 
We can now define the process probability in x— space 

as 

V{x-x',t) =e x P{e x ,t\e x ') 

and in the same way we can introduce the average dis- 
tribution Q{x, i) which by eq. ([5]) satisfy 



Q{x,t) = -Y j V{ 



x,s) 



s=l 



Since the stochastic matrix which correspond to W(x — 
x') is diagonal in Fourier space, we can now write the 
formal expression eq. (II 1 D as 



r dk ~ 
V(x,t)= / — e * kx V(k,t) 



(16) 



where 



t-i 



V(k,t) = \ 



s=l 



1 + 



W{k) - 1 



and W(fc) is the Fourier transform of W(x). Clearly, by 
eq. (1131) . the Fourier transform of Q(x, t) reads 



2(M) 



V{k,t + 1) 
W{k) 



The correct normalization of either V{x,i) or Q(x, i) fol- 
lows from that of W(x), which implies W(0) = 1. Other 
consequences of the probabilistic nature of W(x) is the 
symmetry VV(fc)* = VV(-fc) and the bound \W{k)\ < 1. 
In addition, with the natural requirements made above 
on the one-step pdf W(x), the function W(k) has the 
expansion near k = 



W(Jfe) ~ 1 - ifik - \{v 2 + a 2 )k 2 + ... 



(17) 



and vanishes for large 

In this context of a continuous logspace, the extension 
to a generic initial distribution is very simple: it amounts 
to multiply both Fourier transforms V(k,t) and Q(k,t) 
by the Fourier transform of the initial distribution. 

Using eq. (115ft we have for large t 



Hx,t)= / — 



and 



dk £lkx exp[(WW-l)log*] 

2?r r(w(fc)) 



Q{ )= fjt e , h eipPW-l)lo gt l 

J 2tt r(w(*) + i) 



(18) 



(19) 



up to fully negligible inverse power corrections in t. For 
any given W(fc) the Fourier integral in eqs. fl 18D and fl 19D 
can be computed numerically to high accuracy through 
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Fast Fourier Transform. Moreover, for large t we can de- 
rive very similar asymptotic expansions in inverse powers 
of logf valid for any W(k) in the class described above. 
Since our main interest is in the average distribution pro- 
file Q{x, t) we concentrate on this our attention. 

The leading term for large t is determined only by the 
first two terms of the W{k) expansion Q17D near k = 0, 
with the quadratic term providing the Gaussian domi- 
nance in eqs. (I18D and ( |19t accordig to the central limit 
theorem. From the first and second derivatives in k = 
of Fourier transform Q{k, t), we first compute the mean 
value and standard deviation of the process for large t, 
as 



fit = (x) t = /i(logi + 7 - 1) 
a 2 = ({x - fi t ) 2 ) t 

= Ou 2 +£7 2 )(logi + 7- 1) 



(20) 



i)m 2 



where 7 = 0.5772156 ... is Euler-Mascheroni constant. 
Then in terms of the standard centerd scaled variable 



£ = £(M) 



we have to leading order 



at 



-e/2 



(21) 



Going back the length variable £ through the definition 
x = log(£/£'), we find the lognormal distribution 



Q(W) 



-[log^-log(^*)] 2 /(2S t 2 ) 



£ V^TTCT 2 



(22) 



peaked around the rescaled initial length £' e^ t_<T * . 

Subleading contributions to the above results, of rel- 
ative order l/^/Togi and smaller, at fixed values of £, 
can be computed by the standard perturbation technique 
around Gaussian integrals: one includes also terms of 
order higher than k 2 , say of order n > 4, in the power 
expansion of VV(fc) around k = and then expands to 
order k n also the exponentials of such terms; for com- 
pleteness, also the expansion of the inverse T function 
must be properly extended; finally one integrates explic- 
itly each term of the complete expansion in terms of mul- 
tiple derivatives of the leading Gaussian. One obtains in 
this way a n— degree polynomial in £ times the Gaus- 
sian e - ^ I 2 . The n + 1 coefficients of the polynomial 
are fixed by the first n + 1 moments of the distribution, 
which in turn can be computed directly from the Tay- 
lor series in k = of the Fourier transforms Q(k, t) or 
V(k,t) (by construction, we must impose (£) t = and 
(£ 2 ) t = 1 for the first two moments). Taking into ac- 
count the specific form of these Fourier transforms, it is 
more convenient to calculate the cumulants of Q{x,t) or 
V(x,t), since their n— order cumulant is given by logt 



times the n— order moment of the one-step pdf W(x), 
plus the n— order derivative w.r.t. k of log[r(>V(A;) -I- 1)] 
or log[r(W(fc))] evaluated at k = 0. Moreover, the latter 
contributions are systematically subleading as compared 
to the moments of W(x), so that we have, for the third- 
order and the fourth-order cumulant of Q{x,i) (that is 
the average skewness s t and kurtosis R t of the process, 
up to normalization conventions) : 



h = (eh 



M3 



«t = (f )t ~ 3 



$ 2 V^gt 
fl 4 1 



H 2 logt 



l + O 

1 + o 



log* 
1 

bit 



(23) 



where fi 3 and /i 4 are the third- and fourth-order mo- 
ments of W(x), while /j, 2 = ^ 2 + a 2 is the analogous 
notation for the second moment. In this expression one 
may regard the expectation values as evaluated with 
V(x, t) rather than with Q(x, t), since the differences are 
due solely to the change T(W{k)) -> V(W(k) + 1) from 
eq. dl8D to eq. Q19D and are subleading. We can now rec- 
ognize a distintive mark of the RRT over the real line: for 
sufficiently large time, the kurtosis of the average distri- 
bution profile is certainly positive, since positive definite 
is the fourth moment of any W(x). Another important 
characteristic, which will be further discussed later on, is 
the positivity of the ratio between the skewness {£ 3 )t and 
the third moment fj, 3 of W(x). 

The extension of the main results, eqs. (I20D - (I23D . to 
the case of a generic initial distribution are straightfor- 
ward. In particular, to the cumulants of Q{x,t) one 
would have to add the cumulants of the initial distribu- 
tion, which are constant in time and therefore sublead- 
ing. Thus eqs. (I20D would get additive constants and eqs. 
(I23D would hold unchanged. This is the standard way to 
see how the process forgets abouts the initial conditions 
(in a logarithmic slow way) . 



PROFILE FLUCTUATIONS 

Let us assume that, for a given stochastic matrix W 
and initial distribution Q{l,N), we can explicitly com- 
pute P(£, t) and Q(£, t) at least for large t, as in the pre- 
ceding section. To compare the result to the length distri- 
bution in a single evolution history, or very few of them, 
which is indeed our case, we need to gather information 
also on the fluctuations of the profile of the length distri- 
bution from one history to the other. 

Typically, one would like to rely on the law of large 
numbers. For ergodic Markov chains (with finitely many 
possible events) this law states that the probability that 
the frequency of a certain event in a given history differs 
from its equilibrium probability by any nonzero amount 
vanishes when the history becomes infinitely long. In 
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our case the elementary events are the observed pro- 
tein lengths and the frequency in a given history is just 
the profile of the length distribution in a given evolution 
history. The quantity Q[£,t) discussed above is just the 
expectation value of the profile, that is its average over 
all possible histories. In a Markovian setup with finitely 
many possible events there would be no difference be- 
tween the profile of a specific history and its expectation 
value in the t — > oo limit, which means vanishing pro- 
file fluctuations in the limit and negligible ones for suffi- 
ciently large t. The stochastic process at hand, however, 
is not Markovian, having the (very specific, simple and 
itself random) RRT form of memory and has a number 
of possible events in principle arbitrarily large. In such 
case we expect, thanks to stronger forms of law of large 
numbers like the central limit theorem (and have indeed 
verified in the example class of the preceding section), 
that the average length distribution Q[£,t) assumes, for 
t large enough, a universal nonconstant form which de- 
pends only on very general properties of W. 

What we need then is also that the fluctuations of the 
frequency for large t do not spoil completely the profile 
of its expectation value Q(£, t). Notice, for instance, that 
this is not true for random walks, not even when they 
are recurrent (as generally true in one dimension, which 
is our case). In other words, in the standard RW the fre- 
quency of times the walker visits any given small region 
keeps fluctuating strongly from one very long history to 
the other, never resembling the mean frequency profile. 
This is due to the characteristic dispersion of order \ft of 
the RW, which implies that each elementary event occurs 
an insufficient number of times of order \j\ft to guaran- 
tee a good convergence of the frequency along a single 
history (it would be even worse in d > 1 dimensions) . 

On the contrary, the random memory of the RRT dra- 
matically helps the application of the law of large num- 
bers, since the logarithmic time dependence leads to 
much slower drift and diffusion, strongly reducing the 
inpact of fluctuations on the length distribution. One 
could say that the length distribution is an almost "self- 
averaging" array of random variables, which for suffi- 
ciently long time does not differ too much from its ex- 
pected value. Indeed, at least in the case of the abstract 
RRT, there exist mathematically rigorous theorems about 
the convergence of the chemical distance profile of any 
RRT towards a normal form Oh. In this section we pro- 
vide some quantitative numerical evidence of the same 
property for lengths distribution profiles using a specific 
model for W{l\l'). 

We first revert to the realistic situation of lengths as 
positive integers not smaller than some lower cutoff 
^min > 1; next we consider the following RRT process 
(written as computer pseudocode) 

£ = integer part of e x £(nt); 
\f£> Imin then l(t + l)=l 



where n t is an integer chosen at random from 1 to t 
and x is extracted with the one-step pdf W(x) over the 
continuous logspace; finally we pick for W(x) the max- 
imum entropy form compatible with our general setup, 
namely a Gaussian with mean /j, and standard deviation 
a. This minimum bias choice could even be regarded 
as natural in view of the many different "microscopic" 
and "macroscopic" mechanisms on which the stochas- 
tic process should depend, as discussed in the Introduc- 
tion. However, we make it here mainly for numerical 
definiteness. In any case the analysis of the preceding 
section and the discussion below, at the end of this sec- 
tion, should make it clear that other choices of W(x) in 
the same class would lead to relative changes that van- 
ish as 1/ log t, while preserving important characteristic 
properties like the positivity of the kurtosis. 

It is quite easy on modern personal computers to ac- 
curately simulate the process J24D by running many very 
long random histories. In our simulations, we produced 
10 5 length distributions with the discrete time t running 
from N < 50 to N = 5 • 10 6 . For sake of definiteness 
we started from 25 initial lengths chosen at random from 
30 to 50 and set /j, = 0.16 and a = 0.19. This setup was 
determined in such a way to fit the overall scales of the 
length distribution in the SIMAP database, as will be dis- 
cussed in the next section. In particular, it turns out that 
the effects on the profiles of the lower cutoff £ m ; n are 
fully negligible, so that the scale-invariant framework 
adopted in the preceding section should apply. Indeed, 
one can also check that the discreteness of the lengths 
£(t) does not play any significant role at all w.r.t. the 
continuous case. 

For each distribution we computed the mean and stan- 
dard deviation in the variable x = logi: 

tx t = - t Y.x{j), at = - t jy{j)-n t f (25) 

at prescribed intermediate values of t. Likewise, we com- 
puted the skewness and kurtosis 

st = \jze{j), «t = -3+i^e*cj) (26) 

3=1 3=1 

where as usual = [x(j) — nt]/^t- 

These four parameters are still random variables which 
fluctuate from one RRT to the other. Moreover, except 
for the mean \xt, their average values over all possible 
RRT realizations of t steps do not coincide with the corre- 
sponding parameters of the average distribution Q(x,t), 
since such average values receive contributions also from 
the profile fluctuations. Only the average (fj, t ) is given by 
the quantity p, t in the first eq. (120D . The differences be- 
tween (at), (st), («t) and d t , s t , R t in the second eq. (1201) 
and eqs. (I23D . respectively, cannot be even extimated 
with the help of Q{x,i) alone. This is true a fortiori 
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for the fluctuations. Therefore it is important to provide 
some (numerical) evidence on their behaviour for large 
times. In particular, s t and K t , provide a measure of the 
deviation from gaussianity of a given profile (we refer to 
above-mentioned tmathematical literature for some rig- 
orous bounds in the case of abstract RRT's) . 

We also kept track of all the logspace profiles, after a 
suitable coarse graining: we fixed beforehand a uniform 
binning grid of K intervals of width li« 1 over a por- 
tion of the real line large enough to contain almost all 
£ points produced (e.g. the interval (—5, 5) to comprise 
all points within 5 sigmas); then we computed the frac- 
tion q k of £ points in a given RRT that fall in the fc-th 
interval of the grid. At this stage using continuous or 
discrete lengths does make a difference, since a binning 
grid too fine over the logarithms of integer lengths will 
induce spurious fluctuations. Hence in the discrete case, 
for each integer j repeated n 3 times in a given length 
distribution we filled the real interval (j — 1/2, j + 1/2 
with rij double precision lengths chosen at random; only 
after this smoothing we computed the distribution over 
the regularly space grid in logspace. 

By construction, the average of the discretized density 
1k{t) /h over all possible histories will reproduce the inte- 
gral of the average profile Q(x, t) as a function of £ over 
the fc-th interval of the grid. Then an extimate of the 
profile fluctuations is the standard deviation of q k (i)/h 
for each k. 

With q k (i) we computed another important (and more 
robust) measure of deviation from gaussianity, that is the 
entropy: 

N 

S t = log h + q k (t ) log q k (t) 

fc=i 

In fact, in the t — ¥ oo limit of an infinite RRT and then 
h — > of vanishing grid width, a Gaussian profile for £ 
would have maximal entropy equal to (log 2-k + l)/2 = 
1.41893853.... 

In fig. Q] we show the evolution of the logarithmic 
length distribution along a single history and, for com- 
parison, the evolution of the average profile Q(x,t) ob- 
tained by numerically integrating through FFT eq. J19D 
and superposing the results as in eq. (0 . 

In fig.|2]we show the distributions of the statistical ex- 
timators defined above for few values of t equally spaced 
in logspace. We see that the parameters that measure 
deviation from gaussianity, that is s t , K t and St, have 
mean values that slowly tend to the Gaussian values with 
smaller and smaller fluctuations as t — >• oo. The conver- 
gence behaviour is roughly the ubiquitous one, 1 / log t, 
with variances that vanish faster than the peaks move- 
ment. Also the variance of the standard deviation seems 
to slowly converge. On the other hand the fluctuations 
of the mean do not appear to converge at all; this is re- 
flected in the reduction slower than 1 / log t of the stan- 




2468 10 2468 10 



FIG. 1 : Evolution of the logspace length distribution in a spe- 
cific history (left panel) and on average (right panel), starting 
from the same initial conditions. 



dard deviation of the £ profile fluctuations. In table [J we 
provide further numerical evidence through the standard 
deviations over the 10 5 sample histories of /j, t , at, s t , K t 
and S t . 





standard 






deviation 


A A A 







4 4.5 5 5.5 6 6.5 0.5 0.6 0.7 0.8 0.9 1 




1.34 1.36 1.38 1.4 1.42 -4 -2 2 4 



FIG. 2: Distributions over 10 5 sample histories of the indicated 
statistical extimators of the logspace length profile for the same 
times as in fig. [T] 



t 




Aa t 


As t 


AK t 


AS t 


5 • 10 3 


0.0781 


0.0401 


0.1425 


0.3840 


0.0131 


5- 10 4 


0.0784 


0.0343 


0.0927 


0.2282 


0.0072 


5 • 10 5 


0.0787 


0.0304 


0.0647 


0.1440 


0.0045 


5- 10 6 


0.0785 


0.0277 


0.0482 


0.0980 


0.0031 



TABLE I: Standard deviations over 10 5 samples of the statistical 
extimators of mean, standard deviation, skewness, kurtosis and 
entropy of the logspace length distribution. 
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These results remain qualitatively unchanged under 
generalization from the Gaussian one-step pdf chosen 
above to a generic W(x) of the class discussed in the pre- 
vious section. For fixed and a only t— independent nu- 
merical variations appear due to the change in the higher 
moments of W{x). In particular the skewness can be 
made to assume prevalently positive or negative values 
by choosing a W(x) with positive or negative third mo- 
ment (while keeping the first moment \x > 0), while the 
kurtosis distribution remain always peaked around posi- 
tive values, with a variance which appear to vanish faster 
than the mean. This is in agreement with the properties 
of the average length distribution as given by eq. Q23D . 

In summary, very large RRT's over the space of possible 
protein lengths are indeed almost auto-averaging objects 
and it is sensible to compare the average properties of the 
random process to a few, or even a single, realizations of 
it. 



COMPARISON WITH THE OBSERVED LENGTH 
DISTRIBUTIONS 

To test our simple model we compare here the pre- 
dicted length distributions with the real length distribu- 
tions of proteins observed in nature. In this last decade 
the number of known protein sequences has been rapidly 
growing and is still growing now at a steady pace. A huge 
number of protein sequences coming from very many dif- 
ferent species are now stored in various databases. 

In particular the SIMAP JH] (Similarity Matrix of 
Proteins, http: / /mips.gfs. de /genre /proj/simap) database 
collects about all amino acid sequences from public 
databases and completely sequenced genomes. On 
September 2006 it was storing more than 5.5 millions of 
not-redundant proteins coming from more than 100000 
different species. 



kingdoms 


number of 
species 


number of 
proteins 


bacteria 


11130 


2217301 


viruses 


14631 


319885 


plants 


31232 


1156929 


animalia invertebrates 


25951 


383760 


vertebrates 


19341 


772605 


environmental 
samples 


1453 


32591 


synthetic 


822 


14660 



TABLE II: Number of species and proteins for each kingdom in 
SIMAP on September 2006. 

We report in Table [II] a coarse subdivision of all SIMAP 
proteins and their corresponding species in five (non- 
standard) main kingdoms: bacteria, viruses, plants, in- 
vertebrates (animalia) and vertebrates (animalia). In fig. 



[3] we provide plots of the corresponding length distribu- 
tions. 




500 1000 1500 500 1000 1500 



protein's length 

FIG. 3: Length distributions of SIMAP proteins. Each box shows 
an enlargement of the (not normalized) length distributions of 
proteins coming from all kingdoms ({1} = 335, lmax = 38031), 
bacteria ((I) = 316.9, lmax = 36805), viruses ((I) = 273.9, 
l max = 7312 ), plants ((I) = 314.5, l max = 20925), inverte- 
brates ((/) = 416.1, lmax = 23015), vertebrates ({1} = 397.1, 

lmax = 38031). 

One can see that all SIMAP distribution profiles have 
a global similar shape, with a well defined overall posi- 
tion and scale. There are however also large fluctuations 
on smaller scales. In particular the curves of viruses, 
invertebrates and vertebrates show very high and nar- 
row peaks in correspondence to certain specific values 
of length. Of course, on general grounds, our model 
is too simple and generic to make predictions on other 
than global properties of the profiles, so we should re- 
strict to the lowest moments or cumulants of the distri- 
bution and perform robust coarse graining on the data 
for more refined analysis. We believe, in any case, that 
these peaks are to a large extent spurious, being due to 
an over-representation in the SIMAP database of those 
particular protein lengths. SIMAP in fact contains a lot of 
proteins which not necessarly come from completely se- 
quenced genomes: this fact makes the collection of pro- 
teins not homogeneous over the species present in the 
database and so it is possible that certain peculiar lengths 
are more represented since they correspond to proteins 
of many more different species than other lengths. If 
the collection were homogeneous over the species, we 
would expect length distributions without high narrow 
peaks and also less fluctuating in general. At any rate, 
we verified that the global analysis reported below is al- 
most insensitive to the removal by hand of the high and 
narrow peaks. 

The SIMAP database provides a very large sample of 
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real proteins which can be assumed to be statistically 
significant. We believe therefore that it is sensible as a 
testbed for our model and we make the basic assump- 
tion that the SIMAP length distributions for different 
kingdoms as (almost) independent realizations of our 
stochastic process. The motivation is that different king- 
doms have been going through almost independent evo- 
lutional histories since a long time and, even if one can- 
not forget that far enough in the past there was non dis- 
tinction at all, the main characteristic of the stochastic 
process of forgetting the initial conditions suggests that 
at most a negligible trace remain of the common remote 
past in each kingdoms distribution. 

In table [HI] we list the measured values of the mean, 
standard deviation, skewness, kurtosis and entropy of 
the logarithmic length distributions for the five kingdoms 
separately and for the comulative all kingdoms distribu- 
tion. Except for the entropy, these parameters can be 
computed directly from the statistical extimators as in 
eqs. (I25D and (1261) without any coarse graining. To com- 
pute the entropy we performed a coarse graining in the 
logspace with the same procedure of the preceding sec- 
tion. One can see that the kurtosis is always positive, 
in accordance with the average property of the model 
[eq. (I23D 1 and with the results of the previous section 
on the fluctuations. We also notice that the cumulative 
kurtosis is definitely higher than the individual ones, due 
to the fluctuations in the lower cumulants. Again, this 
is consistent with the interpretation of the kingdoms dis- 
tributions as independent realizations of the process. Ex- 
cept for the viruses, the entropy is always close to the up- 
per Gaussian limit, with the plant distribution the closest 
to a normal form. 



of W(x), /j, and a, a specific large value of t and values of 
the initial lenghts in the range 30 — 50, one can produce 
simulations with eq. J24D , like those reported in fig. [21 
whose distributions profiles fit well the peak positions 
and the sizes of the SIMAP length distributions. 



kingdom 


mean 


s.d. 


skewness 


kurtosis 


entropy 


bacteria 


5.53 


0.68 


-0.20 


0.32 


1.408 


viruses 


5.26 


0.79 


0.30 


0.53 


1.297 


plants 


5.44 


0.78 


-0.01 


0.04 


1.414 


invertebrates 


5.65 


0.87 


-0.03 


0.31 


1.409 


vertebrates 


5.60 


0.89 


-0.18 


0.25 


1.394 


all kingdoms 


5.49 


0.81 


-0.26 


0.63 


1.406 




TABLE III: Global statistical indicators of the SIMAP length dis- 
tributions in logspace. 



In fig. [4] we show the Gaussian fits of the length dis- 
tributions. As expected from the data in table [nil these 
fits appear rather good, apart from fluctuations, which 
are more important when the entropy is lower, that is 
for viruses and vertebrates. Explaining these fluctuations 
is beyond the scope of our model. Moreover, one must 
remember that the SIMAP database is incomplete and, 
as discussed above, probably biased toward particular 
species; these features contribute to local irregularities. 

With fine-tuned choices of the two main parameters 



logarithmic protein's length 

FIG. 4: Length distributions of SIMAP proteins in the logarith- 
mic lengths space. These (not normalized) distributions have 
been obtained through a uniform gridding with 200 intervals 
over x = log£. Before the change of variable from I to x, 
we scattered the protein length values from integer to real to 
avoid the introduction of spurious fluctuations in the logarith- 
mic space. 

Our choice for the initial distribution is based on 
the quite natural assumption that today's proteins are 
evolved from shorter peptide ancestors 0, l25Tl . In any 
case, according to the model, t— independent changes in 
the initial distribution might affects the final distribution 
only by terms of relative order 1 / log t. 

We remark also that in our purely probabilistic frame- 
work no fine quantitative check can be performed, for 
some good reasons. 

Firstly, even assuming that for each kingdom the pro- 
teins in the database constitute a statistically significant 
fraction of the total existing in nature, we do not know 
what the total number might actually be. So we can- 
not fix the precise value of the total discrete time of the 
stochastic process. This does not lead to large uncer- 
tainties, though, since the evolution is only logarithmic 
in this discrete time. 

Secondly, the one-step pdf W(x) governing the model 
can hardly have any precise quantitative relation with 
the moltitude of microscopic and macroscopic effects 
that drive the evolution. So one could not ascribe any 
particular value to a specific functional form of W(x) 
whose most relevant free parameters were determined 
from data fitting. Rather we must restrict to very general 
properties valid for a wide class of one-step pdf's. This 
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argument applies also to the lowest moments of W(x), 
that might very well differ in distinct independent pro- 
cesses. 

Let us examine therefore in more detail to what extent 
the model agrees with the observed distributions. 

First of all, according to the model, the length dis- 
tribution of a large set of protein belonging to a single 
long evolutional history must be almost a Gaussian in 
logspace, that is a lognormal over the lengths. As we 
have just seen, this agrees quite well with the SIMAP 
distributions. The approximate lognormality of protein 
sizes was observed several years ago on much smaller 
data sets Hit . 

Next there are the two scales of the fluctuations of the 
two parameters of a Gaussian, namely the mean and the 
standard deviation, which were denoted as /j, t and a t in 
the previous section. Once a process simulation is fine- 
tuned to produce the correct average values of fj, t and a t , 
their fluctuations have a scale which depends mildly on 
the higher moments of W(x), does not depend on t in the 
case of jj, t and depend at most as 1/ log t in the case of a t ■ 
These scales agree fairly well with those observed over 
the SIMAP kingdom distributions [see tables U and [TTTTl , 
at least when W(x) has skewness and kurtosis not too 
large. Notice that also the lower cutoff £ min on the pos- 
sible lengths acts as constraint on the higher cumulants 
of W{x). For instance, an excessively negative skewness 
in W(x) would typically lead to large left tails also in 
the length distributions which are then abruptly cutoff at 
imin', such abrupt cuts are absent in the observed data, 
as evident from fig. [4) 

Then there are the systematic deviations from gaus- 
sianity. The model predicts a positive kurtosis for any 
W(x) and the SIMAP data agree very well with this. 
Also the entropy is very close to the expected values, 
except for the viruses, whose protein distribution is the 
least abundant, has the smallest mean length and the 
largest relative fluctuations. However, the data in ta- 
ble [m show also always negative skewness, again except 
for the viruses. This characteristic cannot be accounted 
for too easily in the model. 

Indeed, as we have already noted, it is natural to as- 
sume that the average protein length has been growing 
in time. This requires that the first moment \i of the one- 
step pdf W(x) is positive, that is that a positive shifts in 
x— space prevails over negative ones. Now, from the av- 
erage relation (1231) and the numerical analysis reported 
in the preceding section, a prevalently negative skewness 
requires that the third moment n 3 of W(x) is negative. 
The simplest Gaussian one-step pdf used to produce the 
results in fig.[2]has by definition /j, and /j, 3 with the same 
sign, but other pdf's with positive /j, and negative /j, 3 are 
certainly possible. However, this would mean even more 
negative skewness in VV(ai), causing problems with the 
lower cutoff £ m ; n , unless very large unrealistic values of 
the initial lengths are assumed, which in turns would typ- 



ically spoil the overall scale fitting. Thus, after all, there 
is tension on the model in spite of its many free parame- 
ters. 

Our simple model for the protein length distributions 
is based on RRT embedded in the natural numbers, with 
the assumption of almost scale invariance for the tran- 
sition probability W(£\£'). By this we mean that the 
stochastic process is uniform in continuous logspace, 
with translation invariance broken only by length dis- 
creteness and the lower cutoff 4nin, in both cases with 
negligible effects. This is an idealization suggested by 
simplicity (it translates the intuitive idea that longer pro- 
teins can change more than shorter ones throughout evo- 
lutions) and ease of analytic investigation of average 
properties. It has some difficulties in accounting for neg- 
ative skewness (viruses apart), but it is in overall good 
agreement with observations, especially for the positive- 
ness of the kurtosis. This suggests to keep to a minimum 
the modifications in more realistic models for W{l\l'). 
We describe one minimal change in the next section. 



INTRINSIC SMOOTH CUTOFF ON LARGE LENGTHS 

In the stochastic framework we have considered up to 
now, the vanishing of the length distribution for large I is 
determined by the slowly drifting and diffusing charac- 
ter of the process with the assumption of relatively small 
initial lengths. 

On the other hand there are physical reasons to ex- 
pect that very long proteins are intrinsically less proba- 
ble than shorter ones, in the sense that the "microscopic" 
mechanisms that determine, upon countless repetitions, 
the production of longer and longer proteins are eventu- 
ally limited by simple stability criteria: very long pro- 
teins, to be stable against thermal fluctuations in the 
natural environment, must fold in the biologically ac- 
tive form more "tightly" than shorter ones, as could be 
measured by their growing spectral dimension ll26h ; but 
this requires more and more complex stereoscopic order- 
ings while the building blocks (amino acids at the lowest 
level and larger strutures at the second and third level) 
are limited in number and typology. We could there- 
fore expect some form of smooth cutoff on long lengths, 
parametrized by a stability scale l s . 

The minimal change on the model, as anticipated 
above, could therefore be the following: 



w(e\e') = r 1 g(e/e s )w{io g (e/e')) 



(27) 



where W(x) is the usual one-step pdf in logspace and 
g(u) a smooth function which is almost constant for u < 
1 and monotonically decreases to zero for large u. 
The random recursion corresponding to eq. 



is a 
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simple modification of eq. ( [241 ) 



I = integer part of e x l{nt)\ 

if I > l min and r < g{l/l s ) then l(t + 1) = I 



where, as before, n t is a random integer from 1 to t and x 
is extracted from W(x), while the new random number 
r is extracted uniformely in the interval (0,<7 max ), with 
g max = g^irain/is) the assumed largest value of the func- 
tion g. 

Another possibility could be to introduce an explicit 
I— dependence in W(x), in such a way that length re- 
ductions (x < 0) become more probabale than length 
growths (a: > 0) for large enough lengths. In this case 
the random recursion would be the same of eq.. Q24D ; 
the only change is that x is extracted in a weakly non 
scale-invariant way from a one-step pdf W(x; £). 

Once some specific form for W(x) and g(u), or for 
W(x; I), is chosen, simulations with weakly broken scale 
invariance can be performed as easily as before. Since 
there are now more tunable parameters, it is almost ob- 
vious that data fitting can be improved. From a purely 
quantitative point of view this better fits have little signif- 
icance. Instead we want to stress the main new qualita- 
tive aspects: the smooth cutoff typically induces shorter 
right tails in the simulated distribution, thus slightly re- 
ducing both skewness and kurtosis. If the one-step pdf 
has the right characteristics, it is possible to obtain al- 
most always length distributions with still positive kurto- 
sis but negative skewness after a few million steps. The 
cutoff prevents the formation of proteins too long, thus 
allowing to reproduce the observed mean length and 
length variance Typically l s , which by construction pro- 
vides the scale of the rightmost tail, need to be choosen 
between 5000 and 10000, depending on other details of 
the model. 

It is also interesting to observe that the positive skew- 
ness of length distribution of the viruses does not con- 
stitue a problem to the above scenario, since the overall 
size of this distribution is smaller than the others and 
might very well be too small to feel the effects of the 
smooth cutoff on higher lengths. 

CONCLUSIONS AND OUTLOOK 

In this work we have described a simple stochastic 
framework for the theoretical modeling of the evolu- 
tion of protein lengths. It is based on the idea of Re- 
cursive Random Trees over the set of natural numbers. 
RRT's represent the simplest formal implementation of 
the main feature of the evolution process: new biological 
material is produced through modifications of the biolog- 
ical material already existing. In the case of proteins the 
full space over which the RRT grows is that of all amino 
acid sequences, but it can be reduced to more tractable 



spaces when only specific observables are considered, as 
is the case of the protein lengths. 

Of course, the details of the stochastic process, as en- 
coded in the conditional probabilities, are practically out 
of reach, due to the moltitudes of natural causes ranging 
from biochemical interactions to selection mechanisms 
in varying environments. The relevance of the stochastic 
framework is based therefore on the concept of univer- 
sality; namely that, under the law of large numbers, sta- 
tistical coarse grained observations tend to take universal 
forms which depend only on few fundamental features of 
the stochastic process. In the case at hand, the main fea- 
tures are the auto-averaging property of RRT's and the 
approximate scale invariance of the one-step transition 
probability; they imply the universal properties that pro- 
tein length distributions are almost lognormal, with posi- 
tive kurtosis and a specific scale for the overall deviations 
from exact gaussianity. 

There are several routes for improvements. First of 
all, the choice of RRT's (which have a uniform probabil- 
ity over all nodes of the tree for the attachment of the 
new node) is by itself an ideal simplification. In a more 
realistic setup one should consider differently weighted 
nodes in order to mimic certain aspects of the evolu- 
tion process such as selection and differentiation. Then 
there are many more observables other than the distri- 
bution length in protein databases such as SIMAR The 
global statistical analysis of the SIMAP protein homology 
network carried through in ref. 112311 shows several inter- 
esting features which deserve to be studied within some 
generalization of the stochastic process described here. 
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